36 research outputs found
Char-RNN and Active Learning for Hashtag Segmentation
We explore the abilities of character recurrent neural network (char-RNN) for
hashtag segmentation. Our approach to the task is the following: we generate
synthetic training dataset according to frequent n-grams that satisfy
predefined morpho-syntactic patterns to avoid any manual annotation. The active
learning strategy limits the training dataset and selects informative training
subset. The approach does not require any language-specific settings and is
compared for two languages, which differ in inflection degree.Comment: to appear in Cicling201
Multilingual Named Entity Recognition Using Pretrained Embeddings, Attention Mechanism and NCRF
In this paper we tackle multilingual named entity recognition task. We use
the BERT Language Model as embeddings with bidirectional recurrent network,
attention, and NCRF on the top. We apply multilingual BERT only as embedder
without any fine-tuning. We test out model on the dataset of the BSNLP shared
task, which consists of texts in Bulgarian, Czech, Polish and Russian
languages.Comment: BSNLP Shared Task 2019 paper. arXiv admin note: text overlap with
arXiv:1806.05626 by other author
Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
Instruction-tuning has become an integral part of training pipelines for
Large Language Models (LLMs) and has been shown to yield strong performance
gains. In an orthogonal line of research, Annotation Error Detection (AED) has
emerged as a tool for detecting quality issues of gold-standard labels. But so
far, the application of AED methods is limited to discriminative settings. It
is an open question how well AED methods generalize to generative settings
which are becoming widespread via generative LLMs. In this work, we present a
first and new benchmark for AED on instruction-tuning data: Donkii. It
encompasses three instruction-tuning datasets enriched with annotations by
experts and semi-automatic methods. We find that all three datasets contain
clear-cut errors that sometimes directly propagate into instruction-tuned LLMs.
We propose four AED baselines for the generative setting and evaluate them
comprehensively on the newly introduced dataset. Our results demonstrate that
choosing the right AED method and model size is indeed crucial, thereby
deriving practical recommendations. To gain insights, we provide a first
case-study to examine how the quality of the instruction-tuning datasets
influences downstream performance
Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection
Real-life applications, heavily relying on machine learning, such as dialog
systems, demand out-of-domain detection methods. Intent classification models
should be equipped with a mechanism to distinguish seen intents from unseen
ones so that the dialog agent is capable of rejecting the latter and avoiding
undesired behavior. However, despite increasing attention paid to the task, the
best practices for out-of-domain intent detection have not yet been fully
established.
This paper conducts a thorough comparison of out-of-domain intent detection
methods. We prioritize the methods, not requiring access to out-of-domain data
during training, gathering of which is extremely time- and labor-consuming due
to lexical and stylistic variation of user utterances. We evaluate multiple
contextual encoders and methods, proven to be efficient, on three standard
datasets for intent classification, expanded with out-of-domain utterances. Our
main findings show that fine-tuning Transformer-based encoders on in-domain
data leads to superior results. Mahalanobis distance, together with utterance
representations, derived from Transformer-based encoders, outperforms other
methods by a wide margin and establishes new state-of-the-art results for all
datasets.
The broader analysis shows that the reason for success lies in the fact that
the fine-tuned Transformer is capable of constructing homogeneous
representations of in-domain utterances, revealing geometrical disparity to out
of domain utterances. In turn, the Mahalanobis distance captures this disparity
easily.Comment: to appear in AAAI 202
DaNetQA: a yes/no Question Answering Dataset for the Russian Language
DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019)
design: it comprises natural yes/no questions. Each question is paired with a
paragraph from Wikipedia and an answer, derived from the paragraph. The task is
to take both the question and a paragraph as input and come up with a yes/no
answer, i.e. to produce a binary output. In this paper, we present a
reproducible approach to DaNetQA creation and investigate transfer learning
methods for task and language transferring. For task transferring we leverage
three similar sentence modelling tasks: 1) a corpus of paraphrases,
Paraphraser, 2) an NLI task, for which we use the Russian part of XNLI, 3)
another question answering task, SberQUAD. For language transferring we use
English to Russian translation together with multilingual language fine-tuning.Comment: Analysis of Images, Social Networks and Texts - 9 th International
Conference, AIST 2020, Skolkovo, Russia, October 15-16, 2020, Revised
Selected Papers. Lecture Notes in Computer Science
(https://dblp.org/db/series/lncs/index.html), Springer 202